# install packages if not already installed
install.packages("stargazer")
install.packages("ggplot2")
install.packages("plotly") Lab1:
Evaluating PROGRESA
Background
Poverty is one of the most important social and economic problems in the world. In order to alleviate it, many solutions were tried throughout the centuries. For example, in 1601, the Parliament of England passed the The Poor Relief Act, distributing money, food and clothing, and providing work and apprenticeships to the poor. In 1964, US Government created the Food Stamp program, providing food-purchasing assistance for low- and no-income people living in the U.S..
Such programs were criticized because they did not stimulate the recipients to invest in human capital nor improved children’s educational outcomes. To address those two criticisms, many countries implemented conditional cash transfer programs (CCT), i.e., the government only transfers money to low-income persons who meet some criteria, such as enrolling their children into public schools, going to the doctor regularly and receiving vaccinations. The most famous CCT program was implemented in Mexico in 1997 and was initially called PROGRESA (its current name is Oportunidades).
Scientifically, PROGRESA is very influential because it was combined with a large-scale randomized control trial (RCT). In 1997, the Mexican government and the International Food Policy Research Institute defined which villages were eligible to receive this CCT program and, randomly, picked some villages to receive the transfers starting in 1998 (the treated group) and some villages to only receive the transfers starting in 1999 (the control group). The experimental results were very positive with respect to improving families’ social, educational and economic outcomes.
See Skoufias et al. (2001) for more details on PROGRESA and the experimental results.
These positive results led to a rapid expansion of PROGRESA, which covered 2.6 million Mexican households in 50,000 villages by 2001 In addition, the success of PROGRESA stimulated many other countries to create similar programs (Bangladesh, Brazil, Cambodia, Chile, Colombi, Egypt, Guatemala, Honduras, Indonesia, Jamaica, Nicaragua, Panama, Peru, Phillipines, Turkey and the United States).
In this lab, we will look at the effect of PROGRESA on school enrollment using the data from the randomized controlled trial. Here is a quick summary of the program:
- Eligibility for PROGRESA: Only poor households living in poor, rural villages were eligible.
- PROGRESA Treatment: Cash transfer to the mother of each household conditional on regular school attendance between 3rd and 11th grades for children less than 18 years old. The subsidy was slightly higher in 9th grade than in earlier grades to offset higher opportunity cost for older children, and slightly higher for girls than for boys in 7th through 9th grades to promote gender equality in schooling.
- RCT of PROGRESA: Two-thirds of 506 rural villages were randomly assigned to be treated villages, and all eligible (sufficiently poor) households in such villages were treated. The remaining villages were randomly assigned to be control villages, and no households in those villages were treated.
- Survey Waves: Five surveys were conducted as part of the RCT. The first two survey waves were conducted in October 1997 and March 1998, and were ``baseline’’ surveys, i.e., were conducted before any household was treated. The last three survey waves were conduced in October 1998, May 1999, and November 1999. These waves are post randomization and after the treatment began.
- Outcome The outcome we will focus on is school enrollment, though these surveys contain a wealth of other information.
Data
As a first step, download the data set PROGRESA.csv. This data is from Lee and Shaikh (2014) as posted at the Journal of Applied Econometrics Data Archive. This data is only a small part of the full PROGRESA data set, see (IFPRI) (2018). Also download and read readme.txt which describes the data.
Load Libraries
Start by loading the following libraries. Install them first if you have not already done so.
| Library | Purpose |
|---|---|
stargazer |
package by Hlavac (2022) for creating tables. Creates publication-ready tables that are especially well-suited for reporting results in the standard format for academic papers in economics and other social sciences. |
ggplot2 |
package by Wickham (2016) for creating figures. Is more complicated to use than the graphing options in base R, but much more powerful and flexible. |
plotly |
package by Sievert (2020) for creating interactive figures. |
# load packages (make sure they are already installed)
library(stargazer)
library(ggplot2)
library(plotly)Instruction for installing and loading R libraries includes:
Grolemund (2014) Appendix B
There are other packages that are more popular for producing tables within data science, and often produce very beautiful tables, though not typically in a format standard for academic journals in economics.
The most standard way to create tables in R is with kable from the knitr package, see description here. We will not be using the pipe operator %>% from the magrittr package that is part of tidyverse, but, for those using the pipe operator, the kableExtra package by Zhu (2019) is very popular, see description here. These packages as well as other good options for creating tables are discussed here.
Read in the data
Set working directory. Make sure that PROGRESA.csv is in that directory.
Use
read.csvto read the downloaded data into adata.frame.
setwd("~/Dropbox/ViaX/Workshop_Files_Summer_22/Lab1") #my path, modify for your computer
df_ <- read.csv("PROGRESA.csv", header=TRUE, sep=",") # read csvTo review data frames in R, see:
Grolemund (2014), Section 3.8;
Instruction for reading data into R include:
Grolemund (2014), Appendix D;
Examine the data
Examine data, make sure we understand variables, data structure, and look for any issues with the data.
Explore the data with R functions such as
names,dim,head,tableandsummary.
You can use the help function to see documentation on a function. For example, help(names) gives you the help page for the function names.
Use the codebook readme.txt to understand the definition of each variable.
names(df_) # name of each column (variable) in data frame df_ [1] "wave" "sooloca" "indivill_seq" "villsize_t" "progresa1"
[6] "sooind_id" "sex1" "age1" "hgc1" "poor1"
[11] "school"
dim(df_) # number of rows (observations) and columns (variables) in data frame [1] 74031 11
head(df_) # first 6 rows (observations) of data frame df_| wave | sooloca | indivill_seq | villsize_t | progresa1 | sooind_id | sex1 | age1 | hgc1 | poor1 | school |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 76 | 1 | 18 | 0 | 10 | 1 | 1 | 1 |
| 1 | 1 | 2 | 76 | 1 | 5 | 0 | 11 | 2 | 1 | 1 |
| 1 | 1 | 3 | 76 | 1 | 30 | 0 | 9 | 2 | 1 | 1 |
| 1 | 1 | 4 | 76 | 1 | 42 | 1 | 14 | 3 | 1 | 0 |
| 1 | 1 | 5 | 76 | 1 | 14 | 1 | 7 | 0 | 1 | 1 |
| 1 | 1 | 6 | 76 | 1 | 8 | 0 | 12 | 2 | 1 | 1 |
table(df_$wave) # frequency counts for variable wave
1 2 3 4 5
14996 12368 15455 15610 15602
table(df_$progresa1) # frequency counts for variable progresa (treatment status)
0 1
27723 46308
table(df_$hgc1) # frequency counts for variable hgc1 (highest grade completed)
0 1 2 3 4 5 6 7 8 9
4669 8863 9685 9876 8926 8187 13296 3858 3049 3622
summary(df_) # descriptive statistics of all variables in data frame df_ wave sooloca indivill_seq villsize_t
Min. :1.00 Min. : 1 Min. : 1.00 Min. : 1.00
1st Qu.:2.00 1st Qu.:131 1st Qu.: 8.00 1st Qu.: 26.00
Median :3.00 Median :253 Median : 18.00 Median : 39.00
Mean :3.06 Mean :254 Mean : 25.63 Mean : 50.26
3rd Qu.:4.00 3rd Qu.:395 3rd Qu.: 34.00 3rd Qu.: 63.00
Max. :5.00 Max. :491 Max. :210.00 Max. :210.00
progresa1 sooind_id sex1 age1
Min. :0.0000 Min. : 1 Min. :0.000 Min. : 0.00
1st Qu.:0.0000 1st Qu.: 3983 1st Qu.:0.000 1st Qu.: 9.00
Median :1.0000 Median : 7894 Median :0.000 Median :11.00
Mean :0.6255 Mean : 7877 Mean :0.485 Mean :11.33
3rd Qu.:1.0000 3rd Qu.:11768 3rd Qu.:1.000 3rd Qu.:14.00
Max. :1.0000 Max. :15669 Max. :1.000 Max. :99.00
hgc1 poor1 school
Min. :0.000 Min. :1 Min. :0.0000
1st Qu.:2.000 1st Qu.:1 1st Qu.:1.0000
Median :4.000 Median :1 Median :1.0000
Mean :4.029 Mean :1 Mean :0.8275
3rd Qu.:6.000 3rd Qu.:1 3rd Qu.:1.0000
Max. :9.000 Max. :1 Max. :1.0000
Notes:
The data has 74031 child-wave observations, not 74031 children, since each child can appear in the data up to 5 times.
The data has an uneven number of observations per wave, in other words, this is an unbalanced panel.
poor==1 for all observations, so that all observations are poor enough to be eligible for PROGRESA. We can therefore drop the variable poor.
The youngest child is 0 years old, while the oldest child is 99 years old. With the code below, we find that, while the vast majority of rows have an age between 6 and 18, 1 row has a child less than 6 years old, and 261 rows report the child being older than 18 years old. These ages are probably data entry errors. We will later code ages under 6 or over 18 as missing, believing that such values are likely to be data entry errors.
sum(df_$age1<6)
# 1
sum(df_$age1>=6 & df_$age1<=18)
# 73769
sum(df_$age1>18)
# 261
sum(df_$age1==99)
# 10A dummy variable is a variable that always equals \(0\) or \(1\). They are also called logical indicator variables. We typically use them to represent TRUE/FALSE, YES/NO type data – for example, you are over age \(18\) or you are not.
The mean of a dummy variable is the fraction of observations for which the variable equals \(1\). Likewise, the sum of a dummy variable is the number of observations for which the variable equals \(1\).
The following code defines a new dummy variable for being over age 18, and computes the number of observations over age 18 and the fraction of observations over age 18.
df_$Dummy1 <- df_$age1>18 #Create dummy variable for age1>18
sum(df_$Dummy1) # number of observations with age1>18
mean(df_$Dummy1) # fraction of observations age1>18The following code is equivalent, without the step of defining the dummy variable:
sum(df_$age1>18) # number of observations with age1>18
mean(df_$age1>18) # fraction of observations with age1>18We could further investigate the observations with unusual ages. For example, we could inspect the one row with a reported age of \(0\), and see that the 0-year old child has already completed first grade, making it unlikely that the reported age of zero is correct:
df_[df_$age1==0,]| wave | sooloca | indivill_seq | villsize_t | progresa1 | sooind_id | sex1 | age1 | hgc1 | poor1 | school | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 44394 | 4 | 44 | 10 | 14 | 0 | 1586 | 0 | 0 | 1 | 1 | 1 |
The observation’s child id is 1586 and we could inspect the other waves for the same child.
df_[df_$sooind_id==1586,]| wave | sooloca | indivill_seq | villsize_t | progresa1 | sooind_id | sex1 | age1 | hgc1 | poor1 | school | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1418 | 1 | 44 | 3 | 13 | 0 | 1586 | 0 | 7 | 1 | 1 | 1 |
| 28916 | 3 | 44 | 11 | 14 | 0 | 1586 | 0 | 99 | 1 | 1 | 1 |
| 44394 | 4 | 44 | 10 | 14 | 0 | 1586 | 0 | 0 | 1 | 1 | 1 |
| 60001 | 5 | 44 | 2 | 14 | 0 | 1586 | 0 | 7 | 1 | 1 | 1 |
We see that the same child was coded as having the following ages:
- Wave 1, age 7;
- Wave 3, age 99;
- Wave 4, age 0;
- Wave 5, age 7
Clearly, the child age couldn’t jump from age 7 to age 99 between waves 1 and 3, and couldn’t go down in age from 99 years old to 0 years old from wave 3 to wave 4. Clearly the ages of 0 and 99 are coded incorrectly. We could guess that the child is actually 7 years old in all the waves (imputing the ages in waves 3 and 4 based on the ages in waves 1 and 5), but here we will just later code the 0 and 99 as missing.
Number of Observations
The number of observations plays a critical role in statistical analysis, including our choice of research methodology. The more (independent) observations:
the more precise the estimates;
the better the approximations coming from the central limit theorem and other asymptotic results;
the higher the power of tests to reject false hypotheses, and the narrower our confidence intervals;
the more flexible are the models that we can feasibly fit to the data.
We have 74031 rows in our data frame. It is a panel data set, where each row is a child-wave observations, with each child appearing in the data frame up to \(5\) times. We shouldn’t think of number of child-wave observations as number of independent observations. For example, a given child being a boy in wave one is not independent of the same child being a boy in wave two. We might think that independence across children is reasonable, or at least a reasonable approximation, and thus might consider our number of observations to be the number of children when thinking about statistical precision or accuracy of asymptotic approximations. On the other hand, the experiment assigned treatment at the village level, and we might think children within a village (certainly within the same family) are not independent. We might therefore think that we should think of the number of villages as the number of observations when considering statistical precision and asymptotic approximations.
- Recall
- Child id is sooind_id.
- Village id is sooloca.
- Child id is sooind_id.
- Using
nrow, we find that the data has- 74031 rows (child-wave observations)
- Using
lengthandunique, we find that the data has- 15669 child-observations in the data, of which 9799 are treated and 5870 are control.
- 491 villages, of which 308 are treated and 183 are control.
nrow(df_) # Number of rows (child-wave observations)
# 74031
length(unique(df_$sooind_id)) #Number of children
# 15669
length(unique(subset(df_,df_$progresa1==1)$sooind_id)) #Number of treated children
# 9799
length(unique(subset(df_,df_$progresa1==0)$sooind_id)) #Number of control children
# 5870
length(unique(df_$sooloca)) #Number of villages
# 491
length(unique(subset(df_,df_$progresa1==1)$sooloca)) #Number of treated villages
# 308
length(unique(subset(df_,df_$progresa1==0)$sooloca)) #Number of control villages
# 183uniqueis a function whose argument is a vector (or data.frame) and which returns the unique values of that vector. For example:
a<- c(1,1,2,2,3,4,5)
unique(a)
# 1 2 3 4 5
b<- c("Ed","Joe","Ed","Lisa","Joe","Karen")
unique(b)
# "Ed" "Joe" "Lisa" "Karen"lengthis a function whose argument is a vector and which returns the length of that vector. For example:
length(a)
# 7
length(b)
# 6
a.0 <- unique(a)
b.0 <- unique(b)
length(a.0)
# 5
length(b.0)
# 4In R one can nest functions. For example: length(unique(a)) is a nested function, and should be interpreted from the inner most function (inner most parentheses) outward. First the function unique is applied the vector a, finding the unique elements of a, and then length finds the length of that vector and thus the number of unique elements of a. The following code is equivalent:
length(unique(c(1,2,1,2,3,4,4,5)))
# 5
a<- c(1,2,1,2,3,4,4,5)
a.0 <- unique(a)
length(a.0)
# 5Likewise, the code length(unique(subset(df_,df_$progresa1==0)$sooind_id)) has nested functions, and should be interpreted from the inner most function (inner most parentheses) outward. In particular:
subset(df_,df_$progresa1==0)is taking the subset of the data frame df_ that has the variableprogresa1==0(untreated observations).subset(df_,df_$progresa1==0)$sooind_idis taking the column (variable) sooind_id (child id) from the resulting data frame.
unique(subset(df_,df_$progresa1==0)$sooind_id)is taking the unique elements of the vectorsubset(df_,df_$progresa1==0)$sooind_id, in other words, the unique child id numbers from the data frame of control observations.
length(unique(subset(df_,df_$progresa1==0)$sooind_id))is taking the length of the vectorunique(subset(df_,df_$progresa1==0)$sooind_id), in other words, how many unique child id numbers are there for control children.
We can use nested functions, which generally results in more concise code, though the code can potentially be confusing. The following code is equivalent:
length(unique(subset(df_,df_$progresa1==0)$sooind_id))
#5870
c.0 <- subset(df_,df_$progresa1==0)$sooind_id
c.1 <- unique(c.0)
length(c.1)
#5870Preparing data for analysis
Before we actually start analyzing the data, we will prepare it for analysis. In particular, we will complete the folowing steps:
Change name of variables for convenience:
- progresa1 to treat,
- sex1 to girl .
- progresa1 to treat,
Using the
subsetcommand, restrict data to second wave (last baseline wave) and fifth wave (final post-treatment wave) and keep only relevant variables.Using the
ifelsecommand, code as missing any age less than 6 or greater than 18.Using the
ifelsecommand, create an indicator variable for whether an observation is from the post period, i.e., a variable that equals 1 if the observation is from waves 5.Create seperate data frames for pre and post treatment observations.
df_$treat <- df_$progresa1
df_$girl <- df_$sex1
df<-subset(df_, wave==5 | wave==2,select= c(sooloca,sooind_id,age1,hgc1,school,treat,girl,wave))
rm(df_)
df$age1 <- ifelse(df$age1>=6 & df$age1<=18, df$age1, NA) # setting implausible ages to missing
df$post <- ifelse(df$wave==5, 1, 0) # creating dummy variable for post period
dfPre <- df[df$post==0,] # creating new data.frame with only pre treatment observations
dfPost <- df[df$post==1,] # creating new data.frame with only post treatment observationsThis data is already fairly “clean”. However, data often needs more extensive preparation for analysis, sometimes called “data wrangling”.
For methods to help with more involved “data wrangling” in R, see either:
- R Workflow by Frank Harrell, approach based on
HMISC. - R for Data Science by Hadley Wickham and Garret Grolemund, approach based on
tidyverse.
- R Workflow by Frank Harrell, approach based on
Balance at Baseline: Summary Table
If the randomization is done correctly, then any differences at baseline (pre-treatment) between those assigned to treatment vs control is due purely to chance. However, by pure chance, random assignment can result in differences between those assigned to treatment vs control.
For this reason, it is common in experimental papers to produce a table examining whether, on average, before receipt of treatment, those assigned to treatment have similar values of the covariates as those assigned to control. The tables include mean baseline covariates, separately for those assigned to treatment and those assigned to control. The tables often also include a standardized difference, the difference between the two groups dividing by the pooled estimated standard deviation. The tables often includes t-tests of the null of equal means, though such hypothesis testing in this context is hard to justify. (see Bruhn and McKenzie 2009 Section G).
Researchers sometimes re-randomize treatment assignment if the baseline is not sufficiently balanced. Doing so does typically improve precision, similar to proper linear regression adjustment for baseline characteristics (Li, Ding, and Rubin 2018; Lin 2013; and Negi and Wooldridge 2021). However, doing so invalidates conventional standard errors and inference procedures, and is inferior to other methods to design the experiment or to adjust for differences at baseline. See, e.g., Bai (2022).
Create a table reporting the mean of baseline characteristics separately by treatment assignment, the difference in those means, and the standardized difference (dividing the difference by the pooled estimated standard deviation).
| Control | Treated | Diff. | Std Diff | |
| Girl | 0.50 | 0.48 | -0.02 | -0.04 |
| Age | 10.51 | 10.49 | -0.02 | -0.01 |
| Highest Grade | 3.26 | 3.28 | 0.02 | 0.01 |
| Enrolled | 0.87 | 0.87 | 0.002 | 0.01 |
return_mean_by_treatment <- function(x){
means.t<-tapply(x,dfPre$treat,mean)
var.t<-tapply(x,dfPre$treat,var)
return(c(means.t,var.t))
}
vars <- c("girl","age1","hgc1","school")
output <- sapply(dfPre[vars],return_mean_by_treatment)
means <- output[1:2,]
diffs <- output[2,]-output[1,]
N1 <- sum(dfPre$treat)
N0 <- sum(1-dfPre$treat)
pooled.sd <- sqrt(((N0-1)*output[3,]+(N1-1)*output[4,])/(N0+N1-2))
std.diffs <- (output[2,]-output[1,])/pooled.sd
results0 <- t(rbind(means,diffs,std.diffs))
colnames(results0)<-c("Control","Treated","Diff.","Std Diff")
varlabels <- c("Girl","Age","Highest Grade","Enrolled")
rownames(results0)<-c(varlabels)
stargazer(results0, type="html", digits=2, title="Balance at Baseline")tapply and sapply
tapply allows you to apply a function separately by groups. The general form is tapply(X,INDEX,FUN) which applies the function FUN to the vector X separately by levels of INDEX. For example, tapply(dfPre$age1,dfPre$treat,mean) applies the function mean to the vector dfPre$age1 by levels of dfPre$treat (treated vs control). Using that code, we find that the mean age of controls is 10.51 and the mean age of treated is 10.49
tapply(dfPre$age1,dfPre$treat,mean) # mean age by treatment status 0 1
10.51011 10.48640
sapply allows you to apply a function to each element of a list, vector or data.frame. The general form is sapply(X,FUN) which applies the function FUN to each element of X. For example, sapply(dfPre,mean) applies the function mean to each column (each variable) in the data frame dfPre. In the following code, we first define a list of variables vars, so that dfPre[vars] is a data frame only containing the variables in vars, and then sapply(dfPre[vars],mean) is applying the function mean to each of those variables.
vars <- c("girl","age1","hgc1","school")
sapply(dfPre[vars],mean) # mean of each column in dfPre[vars] girl age1 hgc1 school
0.4873868 10.4953105 3.2742561 0.8739489
Using tapply and sapply can help you create efficient code and save you time.
You can write your own functions in R. Objects of class “function” in R have the form:
# defining function
function.name <- function(<arguments>){
# Function body, code taking arguments as inputs
}
# calling function
function.name(<arguments>)The last value computed in the function body is the value returned by the function. For example, the following function adds one to its argument:
f.addone <- function(x){
x+1
}
f.addone(1)
# 2
f.addone(10)
# 11Defining functions can help make our code more efficient. Consider:
return_mean_by_treatment <- function(x){
means.t<-tapply(x,dfPre$treat,mean)
var.t<-tapply(x,dfPre$treat,var)
return(c(means.t,var.t))
}
vars <- c("girl","age1","hgc1","school")
output <- sapply(dfPre[vars],return_mean_by_treatment)The above code is defining a function named return_mean_by_treatment. It will take a vector as it’s argument. It performs the following steps:
- calculates the mean of the vector separately by treatment status;
- calculates the variance of vector separately by treatment status;
- returns a means and vectors calculated in steps (1) and (2).
Using sapply, the code then applies that function to each column in dfPre[vars].
Balance at Baseline: School Enrollment at Each Grade
Consider balance in school enrollment at each grade level. In particular, create a figure showing fraction enrolled in school at each grade by treatment assignment. Create the figure separately for boys and girls.
To learn more about ggplot, see http://r-statistics.co/ggplot2-Tutorial-With-R.html.
MeanSchC.s <- with(subset(dfPre, treat==0),tapply(school, list(hgc1,girl), mean))
MeanSchT.s <- with(subset(dfPre, treat==1),tapply(school, list(hgc1,girl), mean))
Grade <- as.factor(rep(c(1:10),2))
Group <- c(rep(" Control", 10), rep("Treated", 10))
MeanSch.b <- matrix(c(MeanSchC.s[,1],MeanSchT.s[,1]), nrow = 20, ncol = 1)
MeanSch.g <- matrix(c(MeanSchC.s[,2],MeanSchT.s[,2]), nrow = 20, ncol = 1)
tab.b <- data.frame(Grade, Group, MeanSch.b)
tab.g <- data.frame(Grade, Group, MeanSch.g)
plotPre.b <- ggplot(tab.b, aes(x = Grade, y = MeanSch.b, fill = Group)) +
geom_col(width = 0.7, position = position_dodge(width=0.8)) +
theme_bw(base_size = 11) +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_y_continuous(breaks = seq(from = 0, to = 1, by = 0.1)) +
xlab("Grade Level") +
ylab("Mean School Enrollment")+
ggtitle("Boys: School Enrollment by Grade, Pre Period")
plotPre.g <- ggplot(tab.g, aes(x = Grade, y = MeanSch.g, fill = Group)) +
geom_col(width = 0.7, position = position_dodge(width=0.8)) +
theme_bw(base_size = 11) +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_y_continuous(breaks = seq(from = 0, to = 1, by = 0.1)) +
xlab("Grade Level") +
ylab("Mean School Enrollment")+
ggtitle("Girls: School Enrollment by Grade, Pre Period")ggplotly(plotPre.b,tooltip="y")
ggplotly(plotPre.g,tooltip="y")Using Post Data to Estimate Effect
Now, use mean differences on post-treatment data to estimate the effect of PROGRESA on school enrollment. By random assignment, the mean difference estimator does not suffer from selection bias. Furthermore, we can take mean difference conditional on any covariate that is not effected by the treatment, including any baseline characteristic, to estimate a conditional average treatment effect.
Overall: 0.041,
By sex:
For boys: 0.022,
For girls: 0.06.
mean(dfPost[ dfPost$treat==1,"school"]) - mean(dfPost[ dfPost$treat==0,"school"])
# 0.04063189
mean(dfPost[dfPost$treat==1&dfPost$girl==0,"school"])-mean(dfPost[dfPost$treat==0&dfPost$girl==0,"school"])
# 0.02215276
mean(dfPost[dfPost$treat==1&dfPost$girl==1,"school"])- mean(dfPost[dfPost$treat==0&dfPost$girl==1,"school"])
# 0.05975947Using Post Data to Estimate Effect by Grade Level
Now, create a figure examining the effect of PROGRESA on enrollment by grade level, separately for boys and girls.
Code
MeansC <- with(dfPost,tapply(school,list(treat,girl,hgc1),mean))
dimnames(MeansC)[[1]] <-c("Control","Treated")
dimnames(MeansC)[[2]] <-c("Boys","Girls")
EffSchBoy <- (MeansC[2,1,] - MeansC[1,1,])
EffSchGirl <- (MeansC[2,2,] - MeansC[1,2,])
Grade <- as.factor(c(1:10, 1:10))
EffSch <- matrix(c(EffSchBoy, EffSchGirl), nrow = 20, ncol = 1)
sex <- c(rep("Boy", 10), rep("Girl", 10))
tabEff <- data.frame(Grade, sex, EffSch)
plotEff <- ggplot(tabEff, aes(x = Grade, y = EffSch, fill = sex)) +
geom_col(width = 0.7, position = position_dodge(width=0.8)) +
theme_bw(base_size = 10) +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_y_continuous(breaks = seq(from = -0.1, to = 0.1, by = 0.05)) +
xlab("Grade Level") +
ylab("Mean School Enrollment") +
ggtitle(" Treatment Effects Estimates by Grade and Sex")
ggplotly(plotEff,tooltip="y")Some results are surprising, e.g. the estimated effect on boys in first grade is -0.068. Presumably, some of these results are driven by random sampling noise. The level of random sampling noise depends on the number of observations (and the level of dependence/independence across observations). With i.i.d. data,\[\mbox{Var}(\bar{X}_N)=\frac{\sigma^2}{N},\] where \(\sigma^2 = \mbox{Var}(X_i)\). Estimating a mean difference on a small subgroup will have more sampling noise than an estimate on the full sample – we expect better precision for the estimate for the full sample than in an estimate for girls entering grade \(2\). In addition, estimating results separately by many subgroups can give rise to spurious results due to data mining. We will return to these issues.
No. of Obs. by Grade-Sex-Treatment
Examine:
- The number of boys in each grade by treatment status;
- The number of girls in each grade by treatment status;
- The number of villages with boys in each each grade by treatment status;
- The number of villages with boys in each each grade by treatment status;
These counts should give us some idea of how much sampling noise to expect for each estimated treatment effect by sex and grade.
Code
person.counts <- with(dfPost,tapply(sooind_id,list(treat,girl,hgc1),function(x){length(unique(x))}))
dimnames(person.counts)[[1]] <-c("Control","Treated")
dimnames(person.counts)[[2]] <-c("Boys","Girls")
Grade <- as.factor(rep(c(1:10),2))
Group <- c(rep(" Control", 10), rep("Treated", 10))
numbers.boys <- matrix(c(person.counts[1,1,],person.counts[2,1,]), nrow = 20, ncol = 1)
numbers.girls<- matrix(c(person.counts[1,2,],person.counts[2,2,]), nrow = 20, ncol = 1)
tab.boys <- data.frame(Grade, Group, numbers.boys)
tab.girls <- data.frame(Grade, Group, numbers.girls)
plot.n.boys <- ggplot(tab.boys, aes(x = Grade, y = numbers.boys, fill = Group)) +
geom_col(width = 0.7, position = position_dodge(width=0.8)) +
theme_bw(base_size = 11) +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_y_continuous(limits=c(0,1100),breaks = seq(from = 0, to = 1100, by = 100)) +
xlab("Grade Level") +
ylab("Number of Boy Observations")+
ggtitle("Number of Boy Observations, Treated and Control")
plot.n.girls <- ggplot(tab.girls, aes(x = Grade, y = numbers.girls, fill = Group)) +
geom_col(width = 0.7, position = position_dodge(width=0.8)) +
theme_bw(base_size = 11) +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_y_continuous(limits=c(0,1100),breaks = seq(from = 0, to = 1100, by = 100)) +
xlab("Grade Level") +
ylab("Number of Girl Observations")+
ggtitle("Number of Girl Observations, Treated and Control")
village.counts <- with(dfPost,tapply(sooloca,list(treat,girl,hgc1),function(x){length(unique(x))}))
dimnames(village.counts)[[1]] <-c("Control","Treated")
dimnames(village.counts)[[2]] <-c("Boys","Girls")
numbers.boys.v <- matrix(c(village.counts[1,1,],village.counts[2,1,]), nrow = 20, ncol = 1)
numbers.girls.v<- matrix(c(village.counts[1,2,],village.counts[2,2,]), nrow = 20, ncol = 1)
tab.boys.v <- data.frame(Grade, Group, numbers.boys.v)
tab.girls.v <- data.frame(Grade, Group, numbers.girls.v)
plot.n.boys.v <- ggplot(tab.boys.v, aes(x = Grade, y = numbers.boys.v, fill = Group)) +
geom_col(width = 0.7, position = position_dodge(width=0.8)) +
theme_bw(base_size = 11) +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_y_continuous(limits=c(0,1100),breaks = seq(from = 0, to = 1100, by = 100)) +
xlab("Grade Level") +
ylab("Number of Villages with Boy Observations")+
ggtitle("Number of Villages with Boy Observations")
plot.n.girls.v <- ggplot(tab.girls.v, aes(x = Grade, y = numbers.girls.v, fill = Group)) +
geom_col(width = 0.7, position = position_dodge(width=0.8)) +
theme_bw(base_size = 11) +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_y_continuous(limits=c(0,1100),breaks = seq(from = 0, to = 1100, by = 100)) +
xlab("Grade Level") +
ylab("Number of Villages with Girl Observations")+
ggtitle("Number of Villages with Girl Observations")Conclusion
Overall, we find positive estimated effects of PROGRESA on school enrollment, and estimate that the effect is larger for girls than for boys. The estimated effects vary widely by sex-grade cells, with the results the most extreme and unexpected where we have the fewest number of observations and thus expect the most sampling noise.
The next natural step is to conduct formal inference, for example, to test the null hypothesis of no treatment effect including for subgroups. We will return to hypothesis testing next time.